Customer lifetime Value is the forecasting of net profits that connect to a specific customer during their lifetime relationship with a business. To state it more simply, CLV is the financial value of a customer’s relationship over the lifetime of the organization’s relationship with that customer.

Customer lifetime Value is important because it gives an idea of the amount of repeat business that can be expected from a specific customer. This knowledge assists the organizations in deciding how much they can profitably invest in buying a particular customer for their business.

Once how much a customer buys and the frequency of their purchases is determined, the organizations have a better understanding of how to manage their limited resources.

Problem Statement

To understand the problem of Customer Lifetime Value and provide a data-driven solution through investigating the data and models to explain the relationship between Customer Lifetime Value (Target/Response) and the other variables (Predictors/Explanatory) along with interpreting the visuals, descriptive statistics, tests, and models in terms of the association between the explanatory variables and the response variable.

Policy Profile Variables

Customer Profile Variables

1. Import Libraries

library(tidyverse)
library(plotly)
library(Hmisc)
library(ggplot2)
library(naniar)
library(ggpubr)
library(PerformanceAnalytics)
library(corrplot)
library(plyr)
library(readr)
library(caret)
library(ISLR)
library(mlbench)
library(measures)
library(earth)
library(scales)
library(lmtest)
library(matrixStats)
library(repr)
library(lubridate)
library(olsrr)
library(superml)
library(MASS)
library(phenex)
library(modelr)
library(broom)
library(magrittr)
library(rpart)

2. Loading the dataset

#Loading the dataset
data <- read.csv("Marketing-Customer-Value-Analysis.csv")

3. Understanding the dataset

##    Customer            State             Response           Coverage        
##  Length:9134        Length:9134        Length:9134        Length:9134       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   Education         Effective.To.Date  EmploymentStatus      Gender         
##  Length:9134        Length:9134        Length:9134        Length:9134       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      Income      Location.Code      Marital.Status     Monthly.Premium.Auto
##  Min.   :    0   Length:9134        Length:9134        Min.   : 61.00      
##  1st Qu.:    0   Class :character   Class :character   1st Qu.: 68.00      
##  Median :33890   Mode  :character   Mode  :character   Median : 83.00      
##  Mean   :37657                                         Mean   : 93.22      
##  3rd Qu.:62320                                         3rd Qu.:109.00      
##  Max.   :99981                                         Max.   :298.00      
##  Months.Since.Last.Claim Months.Since.Policy.Inception
##  Min.   : 0.0            Min.   : 0.00                
##  1st Qu.: 6.0            1st Qu.:24.00                
##  Median :14.0            Median :48.00                
##  Mean   :15.1            Mean   :48.06                
##  3rd Qu.:23.0            3rd Qu.:71.00                
##  Max.   :35.0            Max.   :99.00                
##  Number.of.Open.Complaints Number.of.Policies Policy.Type       
##  Min.   :0.0000            Min.   :1.000      Length:9134       
##  1st Qu.:0.0000            1st Qu.:1.000      Class :character  
##  Median :0.0000            Median :2.000      Mode  :character  
##  Mean   :0.3844            Mean   :2.966                        
##  3rd Qu.:0.0000            3rd Qu.:4.000                        
##  Max.   :5.0000            Max.   :9.000                        
##     Policy          Renew.Offer.Type   Sales.Channel      Total.Claim.Amount
##  Length:9134        Length:9134        Length:9134        Min.   :   0.099  
##  Class :character   Class :character   Class :character   1st Qu.: 272.258  
##  Mode  :character   Mode  :character   Mode  :character   Median : 383.945  
##                                                           Mean   : 434.089  
##                                                           3rd Qu.: 547.515  
##                                                           Max.   :2893.240  
##  Vehicle.Class      Vehicle.Size       Customer.Lifetime.Value
##  Length:9134        Length:9134        Min.   : 1898          
##  Class :character   Class :character   1st Qu.: 3994          
##  Mode  :character   Mode  :character   Median : 5780          
##                                        Mean   : 8005          
##                                        3rd Qu.: 8962          
##                                        Max.   :83325

3.1 Frequency distribution of the data

## data 
## 
##  24  Variables      9134  Observations
## --------------------------------------------------------------------------------
## Customer 
##        n  missing distinct 
##     9134        0     9134 
## 
## lowest : AA10041 AA11235 AA16582 AA30683 AA34092
## highest: ZZ77357 ZZ83340 ZZ89380 ZZ91716 ZZ97035
## --------------------------------------------------------------------------------
## State 
##        n  missing distinct 
##     9134        0        5 
## 
## lowest : Arizona    California Nevada     Oregon     Washington
## highest: Arizona    California Nevada     Oregon     Washington
##                                                                  
## Value         Arizona California     Nevada     Oregon Washington
## Frequency        1703       3150        882       2601        798
## Proportion      0.186      0.345      0.097      0.285      0.087
## --------------------------------------------------------------------------------
## Response 
##        n  missing distinct 
##     9134        0        2 
##                       
## Value         No   Yes
## Frequency   7826  1308
## Proportion 0.857 0.143
## --------------------------------------------------------------------------------
## Coverage 
##        n  missing distinct 
##     9134        0        3 
##                                      
## Value         Basic Extended  Premium
## Frequency      5568     2742      824
## Proportion     0.61     0.30     0.09
## --------------------------------------------------------------------------------
## Education 
##        n  missing distinct 
##     9134        0        5 
## 
## lowest : Bachelor             College              Doctor               High School or Below Master              
## highest: Bachelor             College              Doctor               High School or Below Master              
##                                                                          
## Value                  Bachelor              College               Doctor
## Frequency                  2748                 2681                  342
## Proportion                0.301                0.294                0.037
##                                                     
## Value      High School or Below               Master
## Frequency                  2622                  741
## Proportion                0.287                0.081
## --------------------------------------------------------------------------------
## Effective.To.Date 
##        n  missing distinct 
##     9134        0       59 
## 
## lowest : 01/01/11 01/02/11 01/03/11 01/04/11 01/05/11
## highest: 2/24/11  2/25/11  2/26/11  2/27/11  2/28/11 
## --------------------------------------------------------------------------------
## EmploymentStatus 
##        n  missing distinct 
##     9134        0        5 
## 
## lowest : Disabled      Employed      Medical Leave Retired       Unemployed   
## highest: Disabled      Employed      Medical Leave Retired       Unemployed   
##                                                                   
## Value           Disabled      Employed Medical Leave       Retired
## Frequency            405          5698           432           282
## Proportion         0.044         0.624         0.047         0.031
##                         
## Value         Unemployed
## Frequency           2317
## Proportion         0.254
## --------------------------------------------------------------------------------
## Gender 
##        n  missing distinct 
##     9134        0        2 
##                     
## Value         F    M
## Frequency  4658 4476
## Proportion 0.51 0.49
## --------------------------------------------------------------------------------
## Income 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     9134        0     5694    0.984    37657    34641        0        0 
##      .25      .50      .75      .90      .95 
##        0    33890    62320    82104    90374 
## 
## lowest :     0 10037 10074 10097 10105, highest: 99875 99934 99960 99961 99981
## --------------------------------------------------------------------------------
## Location.Code 
##        n  missing distinct 
##     9134        0        3 
##                                      
## Value         Rural Suburban    Urban
## Frequency      1773     5779     1582
## Proportion    0.194    0.633    0.173
## --------------------------------------------------------------------------------
## Marital.Status 
##        n  missing distinct 
##     9134        0        3 
##                                      
## Value      Divorced  Married   Single
## Frequency      1369     5298     2467
## Proportion     0.15     0.58     0.27
## --------------------------------------------------------------------------------
## Monthly.Premium.Auto 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     9134        0      202        1    93.22    33.77     62.0     64.0 
##      .25      .50      .75      .90      .95 
##     68.0     83.0    109.0    129.0    163.4 
## 
## lowest :  61  62  63  64  65, highest: 290 295 296 297 298
## --------------------------------------------------------------------------------
## Months.Since.Last.Claim 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     9134        0       36    0.999     15.1    11.57        1        2 
##      .25      .50      .75      .90      .95 
##        6       14       23       30       33 
## 
## lowest :  0  1  2  3  4, highest: 31 32 33 34 35
## --------------------------------------------------------------------------------
## Months.Since.Policy.Inception 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     9134        0      100        1    48.06    32.21        4       10 
##      .25      .50      .75      .90      .95 
##       24       48       71       87       93 
## 
## lowest :  0  1  2  3  4, highest: 95 96 97 98 99
## --------------------------------------------------------------------------------
## Number.of.Open.Complaints 
##        n  missing distinct     Info     Mean      Gmd 
##     9134        0        6    0.498   0.3844   0.6588 
## 
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##                                               
## Value          0     1     2     3     4     5
## Frequency   7252  1011   374   292   149    56
## Proportion 0.794 0.111 0.041 0.032 0.016 0.006
## --------------------------------------------------------------------------------
## Number.of.Policies 
##        n  missing distinct     Info     Mean      Gmd 
##     9134        0        9    0.936    2.966    2.439 
## 
## lowest : 1 2 3 4 5, highest: 5 6 7 8 9
##                                                                 
## Value          1     2     3     4     5     6     7     8     9
## Frequency   3251  2294  1168   409   407   372   433   384   416
## Proportion 0.356 0.251 0.128 0.045 0.045 0.041 0.047 0.042 0.046
## --------------------------------------------------------------------------------
## Policy.Type 
##        n  missing distinct 
##     9134        0        3 
##                                                        
## Value      Corporate Auto  Personal Auto   Special Auto
## Frequency            1968           6788            378
## Proportion          0.215          0.743          0.041
## --------------------------------------------------------------------------------
## Policy 
##        n  missing distinct 
##     9134        0        9 
## 
## lowest : Corporate L1 Corporate L2 Corporate L3 Personal L1  Personal L2 
## highest: Personal L2  Personal L3  Special L1   Special L2   Special L3  
##                                                                            
## Value      Corporate L1 Corporate L2 Corporate L3  Personal L1  Personal L2
## Frequency           359          595         1014         1240         2122
## Proportion        0.039        0.065        0.111        0.136        0.232
##                                                               
## Value       Personal L3   Special L1   Special L2   Special L3
## Frequency          3426           66          164          148
## Proportion        0.375        0.007        0.018        0.016
## --------------------------------------------------------------------------------
## Renew.Offer.Type 
##        n  missing distinct 
##     9134        0        4 
##                                       
## Value      Offer1 Offer2 Offer3 Offer4
## Frequency    3752   2926   1432   1024
## Proportion  0.411  0.320  0.157  0.112
## --------------------------------------------------------------------------------
## Sales.Channel 
##        n  missing distinct 
##     9134        0        4 
##                                                           
## Value            Agent      Branch Call Center         Web
## Frequency         3477        2567        1765        1325
## Proportion       0.381       0.281       0.193       0.145
## --------------------------------------------------------------------------------
## Total.Claim.Amount 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     9134        0     5106        1    434.1      300    52.26   104.08 
##      .25      .50      .75      .90      .95 
##   272.26   383.95   547.51   773.43   960.12 
## 
## lowest :    0.099007    0.382107    0.423310    0.517753    0.769185
## highest: 2345.413441 2452.894264 2552.343856 2759.794354 2893.239678
## --------------------------------------------------------------------------------
## Vehicle.Class 
##        n  missing distinct 
##     9134        0        6 
## 
## lowest : Four-Door Car Luxury Car    Luxury SUV    Sports Car    SUV          
## highest: Luxury Car    Luxury SUV    Sports Car    SUV           Two-Door Car 
##                                                                   
## Value      Four-Door Car    Luxury Car    Luxury SUV    Sports Car
## Frequency           4621           163           184           484
## Proportion         0.506         0.018         0.020         0.053
##                                       
## Value                SUV  Two-Door Car
## Frequency           1796          1886
## Proportion         0.197         0.206
## --------------------------------------------------------------------------------
## Vehicle.Size 
##        n  missing distinct 
##     9134        0        3 
##                                   
## Value        Large Medsize   Small
## Frequency      946    6424    1764
## Proportion   0.104   0.703   0.193
## --------------------------------------------------------------------------------
## Customer.Lifetime.Value 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     9134        0     8041        1     8005     6158     2475     2662 
##      .25      .50      .75      .90      .95 
##     3994     5780     8962    15433    22064 
## 
## lowest :  1898.008  1898.684  1904.001  1918.120  1940.981
## highest: 66025.754 67907.270 73225.957 74228.516 83325.381
## --------------------------------------------------------------------------------

3.2 Check for null values

It is concluded that there are no missing values in the given dataset.

4. Transforming the dataset

## [1] "2011-02-24" "2011-01-31" "2011-02-19" "2011-01-20" "2011-02-03"
## [6] "2011-01-25"

Inside the dataset, the Effective to date column was not in date format. So we transformed the column.

4.1 Creating new columns

Total.Premium.Since.Last.Claim - New feature is created by multiplying Month since last claim, monthly premium auto and number of policies purchased by the customers. this gives the amount that the company is supposed to get from that customer just after the last claim.

Total.Monthly.Premium - This is the product of monthly premium to number of policies purchased by the customer. This is the total amount that a particular customer is expected to pay to the company.

5. Exploratory Data Analysis

5.1 Correlation plot for numerical variables

5.2 Frequency plots for contious variables.

Out of all the policies which policy is a customer more likely to buy?

Most of the customers prefer buying Personal L3 policy which comes under Personal Auto policy type.

How does Monthly Premium auto and Months since last claim affect the Customer Life time value?

Monthly premium auto contributes significantly to the net profit generated by customers for the bank. As the monthly premium of a policy rises the Customer Lifetime Value increases. Customers who haven’t claimed an amount for a long time will be preferred over the one’s who have recently claimed an amount.

Which sales channel is more likely to increase the customer life time value?

The agents have played a major role in promoting and selling insurance policies to customers. The agents strive hard to deliver the best insurance policies and services to their customers and hence are preferred over all other sales channel.

Do the number of policies affect Total claim amount and CLV?

Number of policies does not have a significant impact on the total claim amount but it seems to affect the CLV values. It is seen that customers prefer buying two policies. However, they don’t go beyond buying two because in the end they can claim an amount from just one policy so there’s absolutely no need to buy more policies and pay monthly premium’s for them.

Are men more likely to buy vehicle insurance policies? Is there a gender bias?

There is no gender bias. Both men and women are equally likely to buy Vehicle insurance policies. The ‘ANOVA’ test is performed to understand if there exists a significant difference between the means of the two groups (Male and Female) with respect to the CLV. Since, p-value > 0.05 we fail to reject the null hypothesis and hence there exists no significant difference between the means of the two groups (Male and Female).

## Call:
##    aov(formula = Customer.Lifetime.Value ~ Gender, data = data)
## 
## Terms:
##                       Gender    Residuals
## Sum of Squares      79863451 431090855476
## Deg. of Freedom            1         9132
## 
## Residual standard error: 6870.707
## Estimated effects may be unbalanced
##               Df    Sum Sq  Mean Sq F value Pr(>F)
## Gender         1 7.986e+07 79863451   1.692  0.193
## Residuals   9132 4.311e+11 47206620

What is the response of the customers with regards to the policy plan?

85.7% of the responses recorded are negative.

Does the vehicle size or class have a role to play in increasing the CLV?

Through ANOVA test, we infer that there is no significant difference in the means of the groups of vehicle size with respect to CLV. It is also observed that the luxury class vehicles have higher CLV compared to other vehicle classes.

## Call:
##    aov(formula = Customer.Lifetime.Value ~ Vehicle.Size, data = data)
## 
## Terms:
##                 Vehicle.Size    Residuals
## Sum of Squares     224887989 430945830938
## Deg. of Freedom            2         9131
## 
## Residual standard error: 6869.928
## Estimated effects may be unbalanced
##                Df    Sum Sq   Mean Sq F value Pr(>F)  
## Vehicle.Size    2 2.249e+08 112443995   2.382 0.0924 .
## Residuals    9131 4.309e+11  47195907                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Does the bank resolve the complaints raised by its customers and how does it affect the CLV?

Yes, the bank tries to resolve the complaints raised by its customers. Most of the customers have no pending complaints. We can see that the CLV value decreases as the number of pending complaints increases.

What is the relation between Income of the customer and Customer lifetime value?

Customers who have no income are unemployed but still contribute to high customer lifetime value. This fact cannot be justified from the data provided. Hence additional data on secondary sources of income must be provided to validate the statement.

Most of the unemployed customers are single.

Retired customers have given more positive response as compared to their negative response with regard to purchase of policy plans.They prefer to purchase policies through agents. Majority of them are living in suburban area.

Renewal offer type 4 has complete negative feedback

Sales happening through agents have higher positive purchase response.

Retired, Medical Leave, Disabled and Unemployed customers accept renewal offer 1 where as majority of the employed people accept offer 2

The employment status of customers in all the 5 state doesnot differ significantly, however California and Oregon contribute to highest CLV values.

Unemployed customers have no income whereas all employed customers have high incomes.

Renewal offer 1 contributes to higher CLV values.

More customers reside in suburban region contributing to more CLV.

Doctors are paying low monthly premium amounts compared to other.

Log Distrubution plot for Customer lifetime value.

The Log Transformation is applied on CLV to transform its skewed distribution to approximately normal. This makes CLV more interpretable

  • How to Identify the most profitable customers?

The most profitable customers for the insurance company are those employed with high income. They contribute to high customer lifetime values.

  • How to segment profitable customers?

Profitable customers can be segmented on the basis of months since last claim, monthly premium auto and number of policies purchased.

  • Should the company invest in its marketing strategies?

The response from the customers about the purchase of policy is negative in most cases. It is suggested that if the company invest in their customer service, offers provided, it will help in gaining positive response and hence acquire more customers.

6. Modeling

6.1 Converting Categorical Variables to Factors

All the categorical variables are converted to factors.

6.2 Converting Numerical Variables to Factors

Even though number of open complaints and number of policies are numerical variables, they only have 6 and 9 levels respectively. So they are also converted to factors.

## 'data.frame':    9134 obs. of  24 variables:
##  $ Customer                     : chr  "BU79786" "QZ44356" "AI49188" "WW63253" ...
##  $ State                        : Factor w/ 5 levels "Arizona","California",..: 5 1 3 2 5 4 4 1 4 4 ...
##  $ Response                     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ Coverage                     : Factor w/ 3 levels "Basic","Extended",..: 1 2 3 1 1 1 1 3 1 2 ...
##  $ Education                    : Factor w/ 5 levels "Bachelor","College",..: 1 1 1 1 1 1 2 5 1 2 ...
##  $ Effective.To.Date            : Date, format: "2011-02-24" "2011-01-31" ...
##  $ EmploymentStatus             : Factor w/ 5 levels "Disabled","Employed",..: 2 5 2 5 2 2 2 5 3 2 ...
##  $ Gender                       : Factor w/ 2 levels "F","M": 1 1 1 2 2 1 1 2 2 1 ...
##  $ Income                       : int  56274 0 48767 0 43836 62902 55350 0 14072 28812 ...
##  $ Location.Code                : Factor w/ 3 levels "Rural","Suburban",..: 2 2 2 2 1 1 2 3 2 3 ...
##  $ Marital.Status               : Factor w/ 3 levels "Divorced","Married",..: 2 3 2 2 3 2 2 3 1 2 ...
##  $ Monthly.Premium.Auto         : int  69 94 108 106 73 69 67 101 71 93 ...
##  $ Months.Since.Last.Claim      : int  32 13 18 18 12 14 0 0 13 17 ...
##  $ Months.Since.Policy.Inception: int  5 42 38 65 44 94 13 68 3 7 ...
##  $ Number.of.Open.Complaints    : Factor w/ 6 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Number.of.Policies           : Factor w/ 9 levels "1","2","3","4",..: 1 8 2 7 1 2 9 4 2 8 ...
##  $ Policy.Type                  : Factor w/ 3 levels "Corporate Auto",..: 1 2 2 1 2 2 1 1 1 3 ...
##  $ Policy                       : Factor w/ 9 levels "Corporate L1",..: 3 6 6 2 4 6 3 3 3 8 ...
##  $ Renew.Offer.Type             : Factor w/ 4 levels "Offer1","Offer2",..: 1 3 1 1 1 2 1 1 1 2 ...
##  $ Sales.Channel                : Factor w/ 4 levels "Agent","Branch",..: 1 1 1 3 1 4 1 1 1 2 ...
##  $ Total.Claim.Amount           : num  385 1131 566 530 138 ...
##  $ Vehicle.Class                : Factor w/ 6 levels "Four-Door Car",..: 6 1 6 5 1 6 1 1 1 1 ...
##  $ Vehicle.Size                 : Factor w/ 3 levels "Large","Medsize",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Customer.Lifetime.Value      : num  2764 6980 12887 7646 2814 ...

6.3 Feature Engineering

Removing ‘Customer’ AND ‘Effective.To.Date’ Variables

##        State Response Coverage Education EmploymentStatus Gender Income
## 1 Washington       No    Basic  Bachelor         Employed      F  56274
## 2    Arizona       No Extended  Bachelor       Unemployed      F      0
## 3     Nevada       No  Premium  Bachelor         Employed      F  48767
## 4 California       No    Basic  Bachelor       Unemployed      M      0
## 5 Washington       No    Basic  Bachelor         Employed      M  43836
## 6     Oregon      Yes    Basic  Bachelor         Employed      F  62902
##   Location.Code Marital.Status Monthly.Premium.Auto Months.Since.Last.Claim
## 1      Suburban        Married                   69                      32
## 2      Suburban         Single                   94                      13
## 3      Suburban        Married                  108                      18
## 4      Suburban        Married                  106                      18
## 5         Rural         Single                   73                      12
## 6         Rural        Married                   69                      14
##   Months.Since.Policy.Inception Number.of.Open.Complaints Number.of.Policies
## 1                             5                         0                  1
## 2                            42                         0                  8
## 3                            38                         0                  2
## 4                            65                         0                  7
## 5                            44                         0                  1
## 6                            94                         0                  2
##      Policy.Type       Policy Renew.Offer.Type Sales.Channel Total.Claim.Amount
## 1 Corporate Auto Corporate L3           Offer1         Agent           384.8111
## 2  Personal Auto  Personal L3           Offer3         Agent          1131.4649
## 3  Personal Auto  Personal L3           Offer1         Agent           566.4722
## 4 Corporate Auto Corporate L2           Offer1   Call Center           529.8813
## 5  Personal Auto  Personal L1           Offer1         Agent           138.1309
## 6  Personal Auto  Personal L3           Offer2           Web           159.3830
##   Vehicle.Class Vehicle.Size Customer.Lifetime.Value
## 1  Two-Door Car      Medsize                2763.519
## 2 Four-Door Car      Medsize                6979.536
## 3  Two-Door Car      Medsize               12887.432
## 4           SUV      Medsize                7645.862
## 5 Four-Door Car      Medsize                2813.693
## 6  Two-Door Car      Medsize                8256.298

The ‘Customer’ variable is removed since it is just ids. For ‘Effective.To.Date’, it only has 2 months, Jan and Feb in the year 2011. So this is also removed.

Binning and Capping

We tried binning three variables Total Claim Amount, Monthly Premium Auto and Income. But, Binning did not help us improve the model performance. Hence, we did not use it in our final model.

We capped Total Claim Amount, Monthly Premium Auto and Income at 0.97 (97th Quartile). Capping too did not help us improve the model performance. Hence, we did not use it in our final model.

Splitting of data into Train and Test data

## [1] 6850   22
## [1] 2284   22

The data is split into train and test with a ratio 3:1

Scaling

Scaling does not significantly impact the regression model. Hence, we have tried but not incorporated it in our modeling.

6.4 Model Building

1. Multiple Linear Regression Model

## 
## Call:
## lm(formula = log(Customer.Lifetime.Value) ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58674 -0.05682 -0.00397  0.03957  0.92721 
## 
## Coefficients: (2 not defined because of singularities)
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    7.335e+00  3.493e-02 209.984  < 2e-16 ***
## StateCalifornia                8.402e-03  7.244e-03   1.160 0.246111    
## StateNevada                    1.258e-02  1.006e-02   1.250 0.211299    
## StateOregon                    1.577e-02  7.522e-03   2.096 0.036110 *  
## StateWashington                6.825e-03  1.041e-02   0.656 0.511927    
## ResponseYes                   -2.530e-04  8.018e-03  -0.032 0.974829    
## CoverageExtended               7.471e-02  9.585e-03   7.794 7.46e-15 ***
## CoveragePremium                9.782e-02  2.011e-02   4.865 1.17e-06 ***
## EducationCollege               4.064e-04  6.589e-03   0.062 0.950824    
## EducationDoctor                3.084e-02  1.404e-02   2.197 0.028044 *  
## EducationHigh School or Below  1.922e-02  6.685e-03   2.875 0.004049 ** 
## EducationMaster                2.624e-02  1.002e-02   2.618 0.008860 ** 
## EmploymentStatusEmployed       5.344e-02  1.386e-02   3.857 0.000116 ***
## EmploymentStatusMedical Leave  3.303e-02  1.686e-02   1.959 0.050170 .  
## EmploymentStatusRetired        7.559e-03  1.924e-02   0.393 0.694422    
## EmploymentStatusUnemployed    -1.470e-02  1.395e-02  -1.054 0.291920    
## GenderM                       -2.445e-02  5.119e-03  -4.776 1.83e-06 ***
## Income                         3.011e-07  1.487e-07   2.024 0.042981 *  
## Location.CodeSuburban         -4.994e-03  1.019e-02  -0.490 0.624146    
## Location.CodeUrban             1.787e-03  9.365e-03   0.191 0.848654    
## Marital.StatusMarried          1.767e-02  7.494e-03   2.358 0.018413 *  
## Marital.StatusSingle          -1.743e-02  8.725e-03  -1.998 0.045745 *  
## Monthly.Premium.Auto           7.113e-03  3.854e-04  18.458  < 2e-16 ***
## Months.Since.Last.Claim        4.622e-04  2.513e-04   1.839 0.065952 .  
## Months.Since.Policy.Inception  7.291e-06  9.157e-05   0.080 0.936538    
## Number.of.Open.Complaints1    -1.739e-02  8.274e-03  -2.102 0.035578 *  
## Number.of.Open.Complaints2    -2.885e-03  1.283e-02  -0.225 0.822148    
## Number.of.Open.Complaints3    -5.925e-02  1.444e-02  -4.103 4.13e-05 ***
## Number.of.Open.Complaints4    -9.907e-02  2.049e-02  -4.836 1.35e-06 ***
## Number.of.Open.Complaints5    -1.287e-01  3.292e-02  -3.910 9.32e-05 ***
## Number.of.Policies2            1.402e+00  6.648e-03 210.811  < 2e-16 ***
## Number.of.Policies3            6.974e-01  8.341e-03  83.609  < 2e-16 ***
## Number.of.Policies4            6.983e-01  1.261e-02  55.355  < 2e-16 ***
## Number.of.Policies5            6.949e-01  1.267e-02  54.850  < 2e-16 ***
## Number.of.Policies6            6.950e-01  1.351e-02  51.438  < 2e-16 ***
## Number.of.Policies7            6.918e-01  1.246e-02  55.542  < 2e-16 ***
## Number.of.Policies8            6.992e-01  1.291e-02  54.161  < 2e-16 ***
## Number.of.Policies9            7.022e-01  1.270e-02  55.272  < 2e-16 ***
## Policy.TypePersonal Auto      -8.892e-03  1.340e-02  -0.663 0.507106    
## Policy.TypeSpecial Auto        2.053e-02  2.317e-02   0.886 0.375717    
## PolicyCorporate L2            -1.935e-02  1.610e-02  -1.202 0.229339    
## PolicyCorporate L3            -7.101e-03  1.480e-02  -0.480 0.631421    
## PolicyPersonal L1             -3.619e-03  7.946e-03  -0.455 0.648811    
## PolicyPersonal L2              2.324e-03  6.760e-03   0.344 0.731063    
## PolicyPersonal L3                     NA         NA      NA       NA    
## PolicySpecial L1              -5.987e-02  3.474e-02  -1.723 0.084870 .  
## PolicySpecial L2               1.811e-03  2.707e-02   0.067 0.946658    
## PolicySpecial L3                      NA         NA      NA       NA    
## Renew.Offer.TypeOffer2         7.872e-03  6.435e-03   1.223 0.221237    
## Renew.Offer.TypeOffer3         9.135e-03  7.693e-03   1.187 0.235099    
## Renew.Offer.TypeOffer4         1.378e-03  9.005e-03   0.153 0.878393    
## Sales.ChannelBranch            9.574e-03  6.331e-03   1.512 0.130539    
## Sales.ChannelCall Center      -2.605e-05  7.161e-03  -0.004 0.997098    
## Sales.ChannelWeb              -3.199e-03  7.931e-03  -0.403 0.686687    
## Total.Claim.Amount             9.080e-06  1.839e-05   0.494 0.621585    
## Vehicle.ClassLuxury Car        2.092e-02  5.331e-02   0.393 0.694685    
## Vehicle.ClassLuxury SUV        5.819e-02  5.292e-02   1.100 0.271555    
## Vehicle.ClassSports Car        1.513e-01  2.039e-02   7.420 1.32e-13 ***
## Vehicle.ClassSUV               1.436e-01  1.763e-02   8.144 4.52e-16 ***
## Vehicle.ClassTwo-Door Car      6.173e-04  6.637e-03   0.093 0.925907    
## Vehicle.SizeMedsize           -5.723e-04  8.404e-03  -0.068 0.945710    
## Vehicle.SizeSmall              2.768e-03  9.794e-03   0.283 0.777497    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2088 on 6790 degrees of freedom
## Multiple R-squared:  0.8986, Adjusted R-squared:  0.8977 
## F-statistic:  1019 on 59 and 6790 DF,  p-value: < 2.2e-16

The above linear model contains all the features.

When we plot the density plot for ‘Customer.Lifetime.Value’, its visible that the graph is skewed towards the left.To tackle this we take log(Customer.Lifetime.Value).

The log transformation reduces or removes the skewness of our original data. Log transformation also de-emphasizes outliers and allows us to potentially obtain a bell-shaped distribution. The idea is that taking the log of the data can restore symmetry to the data.

The adj R-squared value after using log is found to be 0.8977

## [1] "RMSE  3937.60929439273"
## [1] "Adj R2  0.670865413869209"
## [1] "RMSE  4045.75854712155"
## [1] "Adj R2  0.648896880168791"

To predict the RMSE and adj R squared for train and test together we created a function eval_fns. Since log was taken for Customer.Lifetime.Value, exponential of the predict function is taken. Adj R-squared for train is 0.670 which means that 67% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model. Adj R-squared for test is 0.649 which means that 64.9% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model.

2. Multiple Linear Regression Using Feature Selection

The features are selected based on the importance score and the significant p-values obtained from model_1. We use the method rpart to perform feature selection.Rpart means splitting the dataset recursively, which means that the subsets that arise from a split are further split until a predetermined termination criterion is reached.

## 
## Call:
## lm(formula = log(Customer.Lifetime.Value) ~ +Total.Claim.Amount + 
##     Coverage + Education + EmploymentStatus + Marital.Status + 
##     Monthly.Premium.Auto + Number.of.Open.Complaints + Number.of.Policies + 
##     Policy + Renew.Offer.Type + Vehicle.Class, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56716 -0.05499 -0.00481  0.03973  0.91359 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    7.340e+00  3.186e-02 230.364  < 2e-16 ***
## Total.Claim.Amount            -6.295e-06  1.308e-05  -0.481  0.63029    
## CoverageExtended               7.297e-02  9.584e-03   7.613 3.04e-14 ***
## CoveragePremium                9.336e-02  2.012e-02   4.641 3.53e-06 ***
## EducationCollege               8.248e-04  6.589e-03   0.125  0.90038    
## EducationDoctor                2.981e-02  1.404e-02   2.124  0.03374 *  
## EducationHigh School or Below  1.991e-02  6.677e-03   2.982  0.00287 ** 
## EducationMaster                2.721e-02  1.001e-02   2.719  0.00657 ** 
## EmploymentStatusEmployed       6.193e-02  1.286e-02   4.817 1.49e-06 ***
## EmploymentStatusMedical Leave  3.044e-02  1.686e-02   1.806  0.07100 .  
## EmploymentStatusRetired        1.580e-03  1.880e-02   0.084  0.93301    
## EmploymentStatusUnemployed    -2.253e-02  1.363e-02  -1.653  0.09831 .  
## Marital.StatusMarried          1.871e-02  7.470e-03   2.505  0.01228 *  
## Marital.StatusSingle          -1.731e-02  8.695e-03  -1.991  0.04653 *  
## Monthly.Premium.Auto           7.284e-03  3.809e-04  19.123  < 2e-16 ***
## Number.of.Open.Complaints1    -1.744e-02  8.274e-03  -2.107  0.03513 *  
## Number.of.Open.Complaints2    -3.557e-03  1.283e-02  -0.277  0.78155    
## Number.of.Open.Complaints3    -5.950e-02  1.444e-02  -4.122 3.81e-05 ***
## Number.of.Open.Complaints4    -9.758e-02  2.050e-02  -4.760 1.98e-06 ***
## Number.of.Open.Complaints5    -1.306e-01  3.294e-02  -3.964 7.46e-05 ***
## Number.of.Policies2            1.401e+00  6.650e-03 210.719  < 2e-16 ***
## Number.of.Policies3            6.976e-01  8.334e-03  83.696  < 2e-16 ***
## Number.of.Policies4            6.985e-01  1.261e-02  55.379  < 2e-16 ***
## Number.of.Policies5            6.925e-01  1.266e-02  54.717  < 2e-16 ***
## Number.of.Policies6            6.967e-01  1.349e-02  51.646  < 2e-16 ***
## Number.of.Policies7            6.919e-01  1.245e-02  55.549  < 2e-16 ***
## Number.of.Policies8            6.991e-01  1.291e-02  54.147  < 2e-16 ***
## Number.of.Policies9            6.985e-01  1.270e-02  55.002  < 2e-16 ***
## PolicyCorporate L2            -1.848e-02  1.612e-02  -1.147  0.25161    
## PolicyCorporate L3            -6.032e-03  1.481e-02  -0.407  0.68387    
## PolicyPersonal L1             -1.142e-02  1.445e-02  -0.790  0.42950    
## PolicyPersonal L2             -6.086e-03  1.383e-02  -0.440  0.65997    
## PolicyPersonal L3             -7.974e-03  1.341e-02  -0.595  0.55200    
## PolicySpecial L1              -3.989e-02  3.151e-02  -1.266  0.20558    
## PolicySpecial L2               2.579e-02  2.282e-02   1.130  0.25857    
## PolicySpecial L3               2.073e-02  2.320e-02   0.893  0.37164    
## Renew.Offer.TypeOffer2         7.375e-03  6.283e-03   1.174  0.24052    
## Renew.Offer.TypeOffer3         8.304e-03  7.610e-03   1.091  0.27520    
## Renew.Offer.TypeOffer4        -1.070e-03  8.829e-03  -0.121  0.90356    
## Vehicle.ClassLuxury Car        8.885e-03  5.333e-02   0.167  0.86768    
## Vehicle.ClassLuxury SUV        4.446e-02  5.290e-02   0.840  0.40071    
## Vehicle.ClassSports Car        1.453e-01  2.038e-02   7.129 1.11e-12 ***
## Vehicle.ClassSUV               1.389e-01  1.763e-02   7.878 3.83e-15 ***
## Vehicle.ClassTwo-Door Car      1.107e-03  6.636e-03   0.167  0.86747    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2092 on 6806 degrees of freedom
## Multiple R-squared:  0.898,  Adjusted R-squared:  0.8973 
## F-statistic:  1393 on 43 and 6806 DF,  p-value: < 2.2e-16

This model includes features from the importance score and p-values. here Adj R2 is found to 0.8973.

## [1] "RMSE  3945.04179126583"
## [1] "Adj R2  0.669621714777904"
## [1] "RMSE  4043.91530696999"
## [1] "Adj R2  0.64921673116145"

Adj R-squared for train is 0.669 which means that 66.9% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model. Adj R-squared for test is 0.649 which means that 64.9% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model.

3. Polynomial Regression Model

## 
## Call:
## lm(formula = log(Customer.Lifetime.Value) ~ +poly(Total.Claim.Amount, 
##     3, raw = TRUE) + Coverage + Education + EmploymentStatus + 
##     Marital.Status + poly(Monthly.Premium.Auto, 3, raw = TRUE) + 
##     Number.of.Open.Complaints + Number.of.Policies + Policy + 
##     Renew.Offer.Type + Vehicle.Class, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42721 -0.02933 -0.00126  0.02021  0.86818 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                 6.461e+00  6.515e-02  99.167
## poly(Total.Claim.Amount, 3, raw = TRUE)1    1.298e-05  4.134e-05   0.314
## poly(Total.Claim.Amount, 3, raw = TRUE)2   -2.065e-08  5.999e-08  -0.344
## poly(Total.Claim.Amount, 3, raw = TRUE)3    4.660e-12  2.194e-11   0.212
## CoverageExtended                           -1.840e-02  1.060e-02  -1.736
## CoveragePremium                            -4.328e-02  2.127e-02  -2.035
## EducationCollege                           -1.917e-03  6.430e-03  -0.298
## EducationDoctor                             3.027e-02  1.370e-02   2.209
## EducationHigh School or Below               1.572e-02  6.520e-03   2.412
## EducationMaster                             2.659e-02  9.768e-03   2.722
## EmploymentStatusEmployed                    5.722e-02  1.257e-02   4.553
## EmploymentStatusMedical Leave               2.156e-02  1.646e-02   1.310
## EmploymentStatusRetired                     3.524e-03  1.839e-02   0.192
## EmploymentStatusUnemployed                 -2.728e-02  1.332e-02  -2.049
## Marital.StatusMarried                       1.634e-02  7.292e-03   2.241
## Marital.StatusSingle                       -1.863e-02  8.490e-03  -2.194
## poly(Monthly.Premium.Auto, 3, raw = TRUE)1  2.623e-02  1.457e-03  17.997
## poly(Monthly.Premium.Auto, 3, raw = TRUE)2 -1.000e-04  1.005e-05  -9.948
## poly(Monthly.Premium.Auto, 3, raw = TRUE)3  1.524e-07  2.137e-08   7.133
## Number.of.Open.Complaints1                 -1.551e-02  8.075e-03  -1.920
## Number.of.Open.Complaints2                 -6.288e-03  1.252e-02  -0.502
## Number.of.Open.Complaints3                 -5.942e-02  1.409e-02  -4.219
## Number.of.Open.Complaints4                 -9.348e-02  2.001e-02  -4.672
## Number.of.Open.Complaints5                 -1.234e-01  3.214e-02  -3.839
## Number.of.Policies2                         1.399e+00  6.489e-03 215.616
## Number.of.Policies3                         6.946e-01  8.134e-03  85.396
## Number.of.Policies4                         6.941e-01  1.231e-02  56.395
## Number.of.Policies5                         6.905e-01  1.235e-02  55.898
## Number.of.Policies6                         6.939e-01  1.316e-02  52.716
## Number.of.Policies7                         6.916e-01  1.215e-02  56.915
## Number.of.Policies8                         6.953e-01  1.260e-02  55.174
## Number.of.Policies9                         6.931e-01  1.240e-02  55.919
## PolicyCorporate L2                         -2.302e-02  1.574e-02  -1.463
## PolicyCorporate L3                         -5.855e-03  1.447e-02  -0.405
## PolicyPersonal L1                          -1.238e-02  1.413e-02  -0.876
## PolicyPersonal L2                          -9.157e-03  1.352e-02  -0.677
## PolicyPersonal L3                          -9.434e-03  1.310e-02  -0.720
## PolicySpecial L1                           -4.344e-02  3.078e-02  -1.412
## PolicySpecial L2                            1.870e-02  2.229e-02   0.839
## PolicySpecial L3                            2.154e-02  2.265e-02   0.951
## Renew.Offer.TypeOffer2                      1.008e-02  6.133e-03   1.644
## Renew.Offer.TypeOffer3                      1.161e-02  7.429e-03   1.564
## Renew.Offer.TypeOffer4                      3.609e-03  8.621e-03   0.419
## Vehicle.ClassLuxury Car                    -3.369e-02  5.403e-02  -0.624
## Vehicle.ClassLuxury SUV                     4.669e-03  5.390e-02   0.087
## Vehicle.ClassSports Car                    -1.085e-02  2.216e-02  -0.489
## Vehicle.ClassSUV                           -2.117e-02  1.975e-02  -1.072
## Vehicle.ClassTwo-Door Car                   2.595e-03  6.476e-03   0.401
##                                            Pr(>|t|)    
## (Intercept)                                 < 2e-16 ***
## poly(Total.Claim.Amount, 3, raw = TRUE)1   0.753465    
## poly(Total.Claim.Amount, 3, raw = TRUE)2   0.730712    
## poly(Total.Claim.Amount, 3, raw = TRUE)3   0.831790    
## CoverageExtended                           0.082674 .  
## CoveragePremium                            0.041857 *  
## EducationCollege                           0.765604    
## EducationDoctor                            0.027186 *  
## EducationHigh School or Below              0.015907 *  
## EducationMaster                            0.006503 ** 
## EmploymentStatusEmployed                   5.38e-06 ***
## EmploymentStatusMedical Leave              0.190244    
## EmploymentStatusRetired                    0.848025    
## EmploymentStatusUnemployed                 0.040520 *  
## Marital.StatusMarried                      0.025040 *  
## Marital.StatusSingle                       0.028246 *  
## poly(Monthly.Premium.Auto, 3, raw = TRUE)1  < 2e-16 ***
## poly(Monthly.Premium.Auto, 3, raw = TRUE)2  < 2e-16 ***
## poly(Monthly.Premium.Auto, 3, raw = TRUE)3 1.09e-12 ***
## Number.of.Open.Complaints1                 0.054844 .  
## Number.of.Open.Complaints2                 0.615404    
## Number.of.Open.Complaints3                 2.49e-05 ***
## Number.of.Open.Complaints4                 3.04e-06 ***
## Number.of.Open.Complaints5                 0.000125 ***
## Number.of.Policies2                         < 2e-16 ***
## Number.of.Policies3                         < 2e-16 ***
## Number.of.Policies4                         < 2e-16 ***
## Number.of.Policies5                         < 2e-16 ***
## Number.of.Policies6                         < 2e-16 ***
## Number.of.Policies7                         < 2e-16 ***
## Number.of.Policies8                         < 2e-16 ***
## Number.of.Policies9                         < 2e-16 ***
## PolicyCorporate L2                         0.143547    
## PolicyCorporate L3                         0.685756    
## PolicyPersonal L1                          0.380833    
## PolicyPersonal L2                          0.498214    
## PolicyPersonal L3                          0.471393    
## PolicySpecial L1                           0.158131    
## PolicySpecial L2                           0.401420    
## PolicySpecial L3                           0.341567    
## Renew.Offer.TypeOffer2                     0.100295    
## Renew.Offer.TypeOffer3                     0.117980    
## Renew.Offer.TypeOffer4                     0.675501    
## Vehicle.ClassLuxury Car                    0.532923    
## Vehicle.ClassLuxury SUV                    0.930983    
## Vehicle.ClassSports Car                    0.624506    
## Vehicle.ClassSUV                           0.283957    
## Vehicle.ClassTwo-Door Car                  0.688656    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2041 on 6802 degrees of freedom
## Multiple R-squared:  0.9029, Adjusted R-squared:  0.9023 
## F-statistic:  1346 on 47 and 6802 DF,  p-value: < 2.2e-16

Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y|x). Basically it adds the quadratic or polynomial terms to the regression.

We have chosen polynomial regression as our 3rd model as it provides the best approximation of the relationship between the dependent and independent variables. Also the polynomial regression fits a wide range of curvature.

Adj R-squared value is found out to be 0.9023

## [1] "RMSE  3866.95258117393"
## [1] "Adj R2  0.682571460172236"
## [1] "RMSE  3947.30847379713"
## [1] "Adj R2  0.665776562392871"

Adj R-squared for train is 0.683 which means that 68.3% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model. Adj R-squared for test is 0.669 which means that 66.9% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model.

4. Weighted Polynomial Regression Model

## 
## Call:
## lm(formula = log(Customer.Lifetime.Value) ~ Number.of.Policies + 
##     Monthly.Premium.Auto + Total.Claim.Amount + Vehicle.Class + 
##     State + Coverage + Education + Gender + EmploymentStatus + 
##     Marital.Status + Months.Since.Last.Claim + Number.of.Open.Complaints + 
##     Policy.Type + Renew.Offer.Type + Sales.Channel + Vehicle.Size + 
##     I(Monthly.Premium.Auto^2) + I(Total.Claim.Amount^2) + I(Months.Since.Last.Claim^2), 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47117 -0.03463 -0.00363  0.02381  0.87413 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    6.793e+00  4.384e-02 154.965  < 2e-16 ***
## Number.of.Policies2            1.399e+00  6.504e-03 215.167  < 2e-16 ***
## Number.of.Policies3            6.948e-01  8.158e-03  85.161  < 2e-16 ***
## Number.of.Policies4            6.948e-01  1.233e-02  56.355  < 2e-16 ***
## Number.of.Policies5            6.917e-01  1.240e-02  55.765  < 2e-16 ***
## Number.of.Policies6            6.944e-01  1.320e-02  52.595  < 2e-16 ***
## Number.of.Policies7            6.906e-01  1.218e-02  56.691  < 2e-16 ***
## Number.of.Policies8            6.957e-01  1.264e-02  55.049  < 2e-16 ***
## Number.of.Policies9            6.960e-01  1.243e-02  56.011  < 2e-16 ***
## Monthly.Premium.Auto           1.699e-02  6.856e-04  24.782  < 2e-16 ***
## Total.Claim.Amount             2.205e-05  2.351e-05   0.938 0.348269    
## Vehicle.ClassLuxury Car       -1.093e-01  5.274e-02  -2.072 0.038310 *  
## Vehicle.ClassLuxury SUV       -7.798e-02  5.238e-02  -1.489 0.136592    
## Vehicle.ClassSports Car       -1.624e-02  2.215e-02  -0.733 0.463526    
## Vehicle.ClassSUV              -2.413e-02  1.977e-02  -1.220 0.222329    
## Vehicle.ClassTwo-Door Car      1.967e-03  6.494e-03   0.303 0.761978    
## StateCalifornia                9.408e-03  7.076e-03   1.330 0.183713    
## StateNevada                    1.113e-02  9.841e-03   1.130 0.258336    
## StateOregon                    1.721e-02  7.357e-03   2.339 0.019374 *  
## StateWashington                1.054e-02  1.018e-02   1.035 0.300525    
## CoverageExtended              -1.100e-02  1.057e-02  -1.040 0.298169    
## CoveragePremium               -4.427e-02  2.129e-02  -2.079 0.037634 *  
## EducationCollege              -2.539e-03  6.444e-03  -0.394 0.693641    
## EducationDoctor                3.191e-02  1.372e-02   2.325 0.020102 *  
## EducationHigh School or Below  1.572e-02  6.538e-03   2.405 0.016204 *  
## EducationMaster                2.419e-02  9.790e-03   2.471 0.013486 *  
## GenderM                       -2.677e-02  5.003e-03  -5.352 9.01e-08 ***
## EmploymentStatusEmployed       6.419e-02  1.259e-02   5.099 3.51e-07 ***
## EmploymentStatusMedical Leave  2.845e-02  1.649e-02   1.726 0.084461 .  
## EmploymentStatusRetired        1.657e-02  1.844e-02   0.899 0.368815    
## EmploymentStatusUnemployed    -2.128e-02  1.332e-02  -1.598 0.110110    
## Marital.StatusMarried          1.594e-02  7.302e-03   2.183 0.029047 *  
## Marital.StatusSingle          -1.857e-02  8.502e-03  -2.184 0.029024 *  
## Months.Since.Last.Claim       -4.798e-04  9.128e-04  -0.526 0.599186    
## Number.of.Open.Complaints1    -1.711e-02  8.087e-03  -2.116 0.034370 *  
## Number.of.Open.Complaints2    -5.746e-03  1.255e-02  -0.458 0.647106    
## Number.of.Open.Complaints3    -5.915e-02  1.411e-02  -4.193 2.79e-05 ***
## Number.of.Open.Complaints4    -9.633e-02  2.004e-02  -4.808 1.56e-06 ***
## Number.of.Open.Complaints5    -1.247e-01  3.219e-02  -3.873 0.000108 ***
## Policy.TypePersonal Auto       1.210e-03  6.033e-03   0.201 0.841056    
## Policy.TypeSpecial Auto        2.046e-02  1.311e-02   1.560 0.118692    
## Renew.Offer.TypeOffer2         1.151e-02  6.203e-03   1.856 0.063530 .  
## Renew.Offer.TypeOffer3         1.208e-02  7.464e-03   1.619 0.105513    
## Renew.Offer.TypeOffer4         4.929e-03  8.714e-03   0.566 0.571647    
## Sales.ChannelBranch            6.204e-03  6.177e-03   1.004 0.315178    
## Sales.ChannelCall Center      -3.693e-03  6.993e-03  -0.528 0.597483    
## Sales.ChannelWeb              -5.127e-03  7.738e-03  -0.663 0.507627    
## Vehicle.SizeMedsize            1.463e-03  8.218e-03   0.178 0.858707    
## Vehicle.SizeSmall              1.005e-03  9.567e-03   0.105 0.916345    
## I(Monthly.Premium.Auto^2)     -2.965e-05  1.758e-06 -16.866  < 2e-16 ***
## I(Total.Claim.Amount^2)       -1.923e-08  1.574e-08  -1.222 0.221739    
## I(Months.Since.Last.Claim^2)   3.202e-05  2.661e-05   1.203 0.228914    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2044 on 6798 degrees of freedom
## Multiple R-squared:  0.9027, Adjusted R-squared:  0.902 
## F-statistic:  1237 on 51 and 6798 DF,  p-value: < 2.2e-16

## 
##  studentized Breusch-Pagan test
## 
## data:  model_1.0
## BP = 2780.7, df = 51, p-value < 2.2e-16

The residuals are not distributed with equal variance throughout the plot.

Heteroskedasticity refers to situations where the variance of the residuals is unequal over a range of measured values. When running a regression analysis, heteroskedasticity results in an unequal scatter of the residuals (also known as the error term).

To formally test for heteroscedasticity, we perform a Breusch-Pagan test. The Breusch-Pagan test uses the following null and alternative hypotheses: - Null Hypothesis (H0): Homoscedasticity is present (the residuals are distributed with equal variance) - Alternative Hypothesis (HA): Heteroscedasticity is present (the residuals are not distributed with equal variance) Since the p-value from the test is less than 2.2e-16 we will reject the null hypothesis and conclude that heteroscedasticity is a problem in this model.

Since heteroscedasticity is present, we will perform weighted least squares by defining the weights in such a way that the observations with lower variance are given more weight.

## 
## Call:
## lm(formula = log(Customer.Lifetime.Value) ~ Number.of.Policies + 
##     Monthly.Premium.Auto + Total.Claim.Amount + Vehicle.Class + 
##     State + Coverage + Education + Gender + EmploymentStatus + 
##     Marital.Status + Months.Since.Last.Claim + Number.of.Open.Complaints + 
##     Policy.Type + Renew.Offer.Type + Sales.Channel + Vehicle.Size + 
##     I(Monthly.Premium.Auto^2) + I(Total.Claim.Amount^2) + I(Months.Since.Last.Claim^2), 
##     data = train, weights = wt)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.396  -0.411  -0.011   0.304  35.551 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    6.710e+00  1.761e-02 381.100  < 2e-16 ***
## Number.of.Policies2            1.404e+00  9.071e-03 154.769  < 2e-16 ***
## Number.of.Policies3            6.921e-01  5.460e-03 126.758  < 2e-16 ***
## Number.of.Policies4            6.943e-01  9.061e-03  76.625  < 2e-16 ***
## Number.of.Policies5            6.884e-01  8.706e-03  79.073  < 2e-16 ***
## Number.of.Policies6            6.932e-01  9.693e-03  71.518  < 2e-16 ***
## Number.of.Policies7            6.895e-01  8.694e-03  79.316  < 2e-16 ***
## Number.of.Policies8            6.920e-01  9.501e-03  72.827  < 2e-16 ***
## Number.of.Policies9            6.925e-01  9.081e-03  76.257  < 2e-16 ***
## Monthly.Premium.Auto           1.961e-02  3.589e-04  54.647  < 2e-16 ***
## Total.Claim.Amount            -1.594e-05  3.312e-06  -4.813 1.52e-06 ***
## Vehicle.ClassLuxury Car       -5.441e-02  4.059e-02  -1.341 0.180084    
## Vehicle.ClassLuxury SUV       -4.936e-02  3.843e-02  -1.284 0.199131    
## Vehicle.ClassSports Car       -7.377e-02  6.255e-03 -11.794  < 2e-16 ***
## Vehicle.ClassSUV              -5.227e-02  4.036e-03 -12.951  < 2e-16 ***
## Vehicle.ClassTwo-Door Car     -1.103e-03  5.179e-04  -2.129 0.033281 *  
## StateCalifornia                2.322e-03  6.367e-04   3.647 0.000268 ***
## StateNevada                    7.233e-04  1.147e-03   0.631 0.528328    
## StateOregon                    2.688e-03  6.782e-04   3.963 7.47e-05 ***
## StateWashington               -5.044e-03  8.940e-04  -5.642 1.75e-08 ***
## CoverageExtended              -7.980e-03  2.711e-03  -2.944 0.003252 ** 
## CoveragePremium               -3.711e-02  4.897e-03  -7.578 3.99e-14 ***
## EducationCollege              -2.176e-03  6.536e-04  -3.328 0.000878 ***
## EducationDoctor                3.088e-02  2.233e-03  13.832  < 2e-16 ***
## EducationHigh School or Below  1.200e-02  5.454e-04  21.995  < 2e-16 ***
## EducationMaster                2.666e-02  9.843e-04  27.081  < 2e-16 ***
## GenderM                       -1.586e-02  5.242e-04 -30.259  < 2e-16 ***
## EmploymentStatusEmployed       4.756e-02  1.518e-03  31.326  < 2e-16 ***
## EmploymentStatusMedical Leave  4.421e-03  1.799e-03   2.458 0.014005 *  
## EmploymentStatusRetired       -3.135e-02  3.795e-03  -8.260  < 2e-16 ***
## EmploymentStatusUnemployed    -5.387e-02  1.281e-03 -42.062  < 2e-16 ***
## Marital.StatusMarried          7.014e-03  7.130e-04   9.838  < 2e-16 ***
## Marital.StatusSingle          -2.363e-02  8.296e-04 -28.479  < 2e-16 ***
## Months.Since.Last.Claim       -4.059e-04  7.386e-05  -5.496 4.02e-08 ***
## Number.of.Open.Complaints1    -9.904e-03  1.336e-03  -7.411 1.40e-13 ***
## Number.of.Open.Complaints2    -2.461e-02  1.350e-03 -18.228  < 2e-16 ***
## Number.of.Open.Complaints3    -4.157e-02  2.645e-03 -15.718  < 2e-16 ***
## Number.of.Open.Complaints4    -7.428e-02  2.251e-03 -32.993  < 2e-16 ***
## Number.of.Open.Complaints5    -6.013e-02  3.853e-03 -15.608  < 2e-16 ***
## Policy.TypePersonal Auto      -6.550e-03  4.701e-04 -13.933  < 2e-16 ***
## Policy.TypeSpecial Auto       -9.119e-03  1.107e-03  -8.240  < 2e-16 ***
## Renew.Offer.TypeOffer2        -2.834e-03  7.534e-04  -3.761 0.000170 ***
## Renew.Offer.TypeOffer3        -8.123e-03  7.444e-04 -10.913  < 2e-16 ***
## Renew.Offer.TypeOffer4        -4.404e-03  6.483e-04  -6.794 1.18e-11 ***
## Sales.ChannelBranch            4.406e-03  6.045e-04   7.288 3.50e-13 ***
## Sales.ChannelCall Center       3.074e-03  7.391e-04   4.160 3.22e-05 ***
## Sales.ChannelWeb               1.331e-02  7.571e-04  17.579  < 2e-16 ***
## Vehicle.SizeMedsize            4.243e-03  5.561e-04   7.630 2.68e-14 ***
## Vehicle.SizeSmall             -3.375e-03  7.668e-04  -4.402 1.09e-05 ***
## I(Monthly.Premium.Auto^2)     -3.969e-05  1.796e-06 -22.102  < 2e-16 ***
## I(Total.Claim.Amount^2)       -7.122e-11  4.364e-09  -0.016 0.986980    
## I(Months.Since.Last.Claim^2)   1.769e-05  2.136e-06   8.281  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.663 on 6798 degrees of freedom
## Multiple R-squared:  0.9635, Adjusted R-squared:  0.9632 
## F-statistic:  3514 on 51 and 6798 DF,  p-value: < 2.2e-16

One of the key assumptions of regression is that the residuals are distributed with equal variance at each level of the predictor variable. This assumption is known as homoscedasticity. When this assumption is violated, we say that heteroscedasticity is present in the residuals. When this occurs, the results of the regression become unreliable.

We have handled this issue by using weighted polynomial regression, which places weights on the observations such that those with small error variance are given more weight since they contain more information compared to observations with larger error variance.

From summary the adj R-squared value is 0.9632

## [1] "RMSE  3884.68462974516"
## [1] "Adj R2  0.679653625844528"
## [1] "RMSE  3954.83290348664"
## [1] "Adj R2  0.664501142602053"

Adj R-squared for train is 0.679 which means that 67.9% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model. Adj R-squared for test is 0.664 which means that 66.4% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model.

Residual Plots without weights

Fitted vs Residual graph

Residuals plots should be random in nature and there should not be any pattern in the graph. The average of the residual plot should be close to zero. From the above plot, we can see that the red trend line is almost at zero.

Normal Q-Q Plot

Q-Q plot shows whether the residuals are normally distributed. Ideally, the plot should be on the dotted line. If the Q-Q plot is not on the line then models need to be reworked to make the residual normal. In the above plot, we see that points in the middle lie on the line whereas points in the start and towards the end does not lie on the line.

Scale-Location

This shows how the residuals are spread and whether the residuals have an equal variance or not.

Residuals vs Leverage

The plot helps to find influential observations. Here we need to check for points that are outside the dashed line. A point outside the dashed line will be influential point and removal of that will affect the regression coefficients.

So to improve the Normal Q-Q Plot we perform the weighted polynomial regression. The results after are given below:

Weighted Residual Plots

Here we can see most of the points lie on the dotted line for the Normal Q-Q plot.

5. Stepwise Regression

## 
## Call:
## lm(formula = log(Customer.Lifetime.Value) ~ Coverage + Education + 
##     EmploymentStatus + Gender + Income + Marital.Status + Monthly.Premium.Auto + 
##     Months.Since.Last.Claim + Number.of.Open.Complaints + Number.of.Policies + 
##     Vehicle.Class, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57550 -0.05414 -0.00398  0.03782  0.91718 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    7.337e+00  2.933e-02 250.157  < 2e-16 ***
## CoverageExtended               7.481e-02  9.562e-03   7.823 5.93e-15 ***
## CoveragePremium                9.743e-02  2.005e-02   4.859 1.20e-06 ***
## EducationCollege               9.630e-04  6.569e-03   0.147  0.88346    
## EducationDoctor                3.171e-02  1.397e-02   2.271  0.02320 *  
## EducationHigh School or Below  1.957e-02  6.620e-03   2.956  0.00312 ** 
## EducationMaster                2.665e-02  9.959e-03   2.676  0.00747 ** 
## EmploymentStatusEmployed       5.524e-02  1.374e-02   4.021 5.86e-05 ***
## EmploymentStatusMedical Leave  3.356e-02  1.682e-02   1.995  0.04605 *  
## EmploymentStatusRetired        6.524e-03  1.875e-02   0.348  0.72792    
## EmploymentStatusUnemployed    -1.336e-02  1.383e-02  -0.966  0.33392    
## GenderM                       -2.435e-02  5.071e-03  -4.802 1.60e-06 ***
## Income                         3.269e-07  1.448e-07   2.259  0.02394 *  
## Marital.StatusMarried          1.953e-02  7.395e-03   2.641  0.00829 ** 
## Marital.StatusSingle          -1.586e-02  8.607e-03  -1.842  0.06546 .  
## Monthly.Premium.Auto           7.143e-03  3.750e-04  19.048  < 2e-16 ***
## Months.Since.Last.Claim        4.687e-04  2.503e-04   1.873  0.06117 .  
## Number.of.Open.Complaints1    -1.770e-02  8.214e-03  -2.155  0.03121 *  
## Number.of.Open.Complaints2    -4.200e-03  1.278e-02  -0.329  0.74238    
## Number.of.Open.Complaints3    -6.027e-02  1.440e-02  -4.186 2.87e-05 ***
## Number.of.Open.Complaints4    -9.839e-02  2.042e-02  -4.819 1.48e-06 ***
## Number.of.Open.Complaints5    -1.292e-01  3.285e-02  -3.934 8.43e-05 ***
## Number.of.Policies2            1.401e+00  6.580e-03 212.957  < 2e-16 ***
## Number.of.Policies3            6.969e-01  8.278e-03  84.183  < 2e-16 ***
## Number.of.Policies4            6.977e-01  1.255e-02  55.595  < 2e-16 ***
## Number.of.Policies5            6.931e-01  1.260e-02  55.020  < 2e-16 ***
## Number.of.Policies6            6.952e-01  1.341e-02  51.827  < 2e-16 ***
## Number.of.Policies7            6.915e-01  1.240e-02  55.768  < 2e-16 ***
## Number.of.Policies8            6.991e-01  1.285e-02  54.388  < 2e-16 ***
## Number.of.Policies9            7.002e-01  1.263e-02  55.441  < 2e-16 ***
## Vehicle.ClassLuxury Car        2.267e-02  5.319e-02   0.426  0.66997    
## Vehicle.ClassLuxury SUV        6.028e-02  5.278e-02   1.142  0.25347    
## Vehicle.ClassSports Car        1.521e-01  2.033e-02   7.484 8.10e-14 ***
## Vehicle.ClassSUV               1.442e-01  1.758e-02   8.202 2.81e-16 ***
## Vehicle.ClassTwo-Door Car      1.360e-03  6.620e-03   0.205  0.83724    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2087 on 6815 degrees of freedom
## Multiple R-squared:  0.8983, Adjusted R-squared:  0.8978 
## F-statistic:  1770 on 34 and 6815 DF,  p-value: < 2.2e-16
##                                GVIF Df GVIF^(1/(2*Df))
## Coverage                   6.140049  2        1.574139
## Education                  1.045128  4        1.005533
## EmploymentStatus           3.458399  4        1.167776
## Gender                     1.010350  1        1.005162
## Income                     3.034801  1        1.742068
## Marital.Status             1.257790  2        1.059015
## Monthly.Premium.Auto      25.830224  1        5.082344
## Months.Since.Last.Claim    1.006693  1        1.003341
## Number.of.Open.Complaints  1.031542  5        1.003110
## Number.of.Policies         1.043913  8        1.002690
## Vehicle.Class             20.892248  5        1.355185

We have tried bidirectional stepwise linear regression which does not improve the model.

6. Decision Tree

## [1] "RMSE  4031.98888684047"
## [1] "Adj R2  0.654898433137791"
## [1] "RMSE  4138.61491980185"
## [1] "Adj R2  0.632595216780728"

Decision trees provide an effective method of Decision Making because they clearly lay out the problem so that all options can be challenged. They allow us to analyze fully the possible consequences of a decision and provides a framework to quantify the values of outcomes and the probabilities of achieving them.

This model does not improve the adj R-squared value compared to the weighted polynomial regression model.

8. Conclusion

We go with weighted polynomial regression model since it gives better value for adj R-squared and RMSE.This model explains most of the variation in target variable i.e. Customer.Lifetime.Value with respect to the independent variables.

9. Solutions preferred.

  • The employment status of customers in all the 5 states does not differ significantly, however California and Oregon contribute to highest CLV values. Hence the company should try to up their game in the other states with lesser CLV values.

  • The response from the customers about the purchase of policy is negative in most cases. It is suggested that if the company invests in improving their customer service, offers provided, it will help in gaining positive response and hence acquire more customers.

  • Personal policy L3 is able to contribute to highest CLV values by attracting a large customer base. The company should try to advertise the other policies to get them recognized in the market.

  • Doctors and Master degree holders although few in number compared to other categories contribute to almost equal amounts of CLV. Hence, the company must draw the attention of these categories towards their policies. And more focus should be given to education as more and more doctors and post graduate holders will contribute to higher CLV.

  • Target customers based on their monthly premium auto and months since last claim.

  • Sales channels like call center and company websites do not contribute significantly to the CLV. So special emphasis can be given to improve these channels so that more and more customers do not have to just relay on agents to buy a policy.

10. Additional information required.

  • Customers who have no income are unemployed but still contribute to high customer lifetime value. This fact cannot be justified from the data provided. Hence additional data on secondary sources of income must be provided to validate the statement.

  • The age of the customer can contribute to the CLV value. Hence this can be explored only if the data is available. Also we can find out if policies are sanctioned to genuine customers(legal age to drive is 16 in the given states of US)

  • The date of policy purchase is necessary to understand the loyalty of the customer as the number of years they are associated with the company can be calculated and its relation with CLV can be understood.

11. Work Division

We divided the work equally amongst ourselves. Each of us had come up with our own EDA plots and tried many models. The models with better RMSE and Adj R-Squared value were selected

Teamwork is the secret that makes common people achieve uncommon results :P

12. Group-6

  • Reshma Dua (20BDA05)
  • Manu Tom (20BDA23)
  • Cleon Lobo (20BDA28)
  • Girisha Manocha (20BDA29)
  • Keerthana Sajeevan (20BDA39)
  • NagaVarun S.N (20BDA67)

13. Resources

  • Kaggle.com
  • Github.com
  • towardsdatascience.com
  • medium.com
  • R Documentation
  • statology.org
  • geeksforgeeks.com
  • datascienceplus
  • analyticsvidhya
  • R pubs
  • ISLR
  • datacamp
  • KDnuggets